Skip to content

Conversation

@1zg12
Copy link

@1zg12 1zg12 commented Aug 31, 2022

What changes were proposed in this pull request?

If the target Java data class has a circular reference, Spark will fail fast from creating the Dataset or running Encoders.

This PR will add an option for developer to decide whether they would like to skip the circular field, or leave the application to fail.

Why are the changes needed?

If the target Java data class has a circular reference, Spark will fail fast from creating the Dataset or running Encoders.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Issue: https://issues.apache.org/jira/browse/SPARK-33598

@github-actions github-actions bot added the SQL label Aug 31, 2022
@1zg12 1zg12 force-pushed the master branch 6 times, most recently from 428d510 to 8eb5b06 Compare August 31, 2022 11:38
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Aug 31, 2022

Hm, skipping them doesn't seem right either. Not sure if this should be an option; it is just something that doesn't make sense to encode

@1zg12
Copy link
Author

1zg12 commented Sep 1, 2022

Hm, skipping them doesn't seem right either. Not sure if this should be an option; it is just something that doesn't make sense to encode

If it's a field the developer/application is comfortable with having self/circular reference, from Spark perspective I think it should allow the developer to stop the loop gracefully (which is to skip further processing the field in loop at developers' own judgement).

This PR is not to force for either way (stop the whole application immediately as existing or skip the filed if the developer choose to), but leave for the developer a choice to choose. Ultimately, it's the developer building their own application have best knowledge how to handle it.

I guess Spark probably assumed the circular reference must be a mistake made by the developers/application earlier. But it can really be a valid case even it could be rare.

@srowen
Copy link
Member

srowen commented Sep 1, 2022

Can you describe a valid use case? I can't think of one. Encoders are used with data classes, bean-like classes

@1zg12
Copy link
Author

1zg12 commented Sep 1, 2022

Can you describe a valid use case? I can't think of one. Encoders are used with data classes, bean-like classes

Google Protobuf is an example, it's widely used as a data class. In the protobuf class, there is an attribute called Descriptor (this is a generated filed) which circular reference back.

Current spark implementation doesn't work with protobuf.

There are some other examples on the issue:
https://issues.apache.org/jira/browse/SPARK-33598

@jkhalid
Copy link

jkhalid commented Nov 28, 2022

Hey All ,
Is there any update on this PR. I am currently running into an issue where I have a JSON schema that has circular references I am using Encoders.bean(classOf[Example.class]) to map the input to the POJO generated from that schema .it fails because of the circular reference

@srowen srowen closed this Dec 5, 2022
@venkyvb
Copy link

venkyvb commented Dec 20, 2022

Hey all,
Wondering if this PR (or some similar fix got merged). I have similar issues with circular references and it would be great to have an option to skip the check.
PS: My example is not related to protobuf but is related to the POJO classes generated from the XSD schemas for FHIR - https://hl7.org/fhir/downloads.html
Thanks.

@lsgrep
Copy link

lsgrep commented Mar 17, 2023

Hi, I am having this circular reference problem while processing the Kafka avro messages with Spark 3.3.0.

Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema

https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/Schema.java

@lsgrep
Copy link

lsgrep commented Mar 17, 2023

Hi @srowen , would you consider supporting avro schemas as a valid reason for supporting this feature as avro is pretty popular in general? Thanks

@srowen
Copy link
Member

srowen commented Mar 17, 2023

Still seems weird to me --
Does this happen to even be 'enough' for the protobuf case? Or does this extra unwanted descriptor field add other unneeded cols?
Is it 'too much' - Is it skipping real circular references that matter, but can't translate to tabular schemas?
Is it solvable by just subclassing the bean class and hiding the field that isn't desirable to begin with?
How does the circular ref arise in the avro case, different?

I get it just seems a bit too hacky as the 'right' solution. Sometimes hacks are worth it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants